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In the last few years, there has been a lot of research into the use of machine 
learning for speech recognition applications. However, applications to 
develop and evaluate air traffic controllers’ communication skills in 
emergency situations have not been addressed so far. In this study, we 
proposed a new automatic speech recognition system using two 
architectures: The first architecture uses convolutional neural networks and 
gave satisfactory results: 96% accuracy and 3% error rate on the training 
dataset. The second architecture uses recurrent neural networks and gave very 
good results in terms of sequence prediction: 99% accuracy and e~7% error 
rate on the training dataset. Our intelligent communication system (ICS) is 
used to evaluate aeronautical phraseology and to calculate the response time 
of air traffic controllers during their emergency management. The study was 
conducted at International Civil Aviation Academy, with third-year air 
traffic control engineering students. The results of the trainees' performance 
prove the effectiveness of the system. The instructors also appreciated the 


instantaneous and objective feedback. 
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1. INTRODUCTION 

Maintaining and improving human performance can only be achieved by focusing training on the 
skills needed to perform their duties safely and effectively, and if the training involves a variety of scenarios 
that expose people to the most relevant threats and errors in their environment. Human error is frequently 
cited in air accident investigation reports as a major cause of accidents and serious incidents, despite the 
evolution of the technologies and safety systems used [1]. Security is therefore only possible through 
practical training that makes error less probable and their consequences less serious. When an aviation 
emergency is declared, it is mandatory to think quickly and act immediately [2]. However, the need to 
communicate effectively and in a timely manner, as well as the lack of qualified personnel and time, causes 
stress that impairs the air traffic controller's situational awareness and decision making and can lead to 
serious incidents: 60% of communication errors between pilots and controllers are the cause of accidents or 
incidents [3]. According to a study by NASA's aviation safety reporting systems (ASRS) database, Incorrect 
controller -pilot communication is a causal factor in 80% of aviation incidents or accidents, while late 
communication accounts for 12% of the causes leading to incidents or accidents, as shown in Table | [4]. 
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Table 1. Communication factors [4] 


Factor Percentage of Reports 
Incorrect communication 80% 
Absence of communication 33% 
Correct but late communication 12% 


In order to reduce communication errors and ensure redundancy of communications between the 
controller and the pilot, the International Civil Aviation Organization (ICAO) has established a confirmation 
and correction process as a defence against communication errors [4], as shown in the Figure 1. However, in 
an abnormal or emergency situation (ABES), where every second counts, this communication loop can only 
be effective if the aeronautical phraseology used is correct and standard. Communication errors and the waste 
of time repeating messages in this kind of stressful situation can have tragic consequences. 


Figure | . Controller-pilot communication loop [4] 


Traditionally, the performance of trainee air traffic controllers is assessed on simulators by requiring 
the presence of pseudo-pilots. Figure 2 illustrates the whole controller/pseudo-pilot communication process 
and associated devices [5]. However, the evaluation of performance during an emergency or abnormal 
situation (aircraft engine failure in our scenario) should not only determine whether the communication has 
been made but also whether the aeronautical phraseology is used correctly and in a timely manner. This can 
only be achieved by designing new systems that can perform an instantaneous and objective assessment of air 
traffic controllers’ verbal communication. 
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Figure 2. Controller/Pseudo-pilot communication [5] 
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Deep learning algorithms have mainly been used to improve the computer's capabilities to 
understand human behaviour, including speech recognition [6]. With the introduction of artificial intelligence 
[7], [8] speech recognition has in fact received a lot of attention in recent years and is proving to be an 
excellent tool for the analysis of instantaneous phraseology in an Air traffic controllers (ATC) simulator 
environment by replacing pseudo-pilots with an automatic speech recognition device [9]. It is therefore 
interesting to implement new interactive systems based on automatic speech recognition that allow the 
evaluation of air traffic controllers’ communication skills, especially when they are confronted with stressful 
situations [10]. Our present study thus aims to propose a new intelligent communication system based on 
automatic speech recognition that should recognise the phraseology errors made in real time by student air 
traffic controllers when faced with ABES. 

To achieve this, we organise our present paper as follows: section 1 gives an overview of the 
proposed system, while section 2 presents the creation of an intelligent speech recognition architecture using 
convolutional neural networks (CNN) and recurrent neural networks (RNN). The results and performance are 
described in section 3. Finally, section 4 concludes our research work. 


2. RESEARCH METHOD 
2.1. Overview of the proposed system 

ATC are trained to use standard and correct International Civil Aviation Organisation (ICAO) 
phraseology. However, it has been observed that many air traffic controllers can work for long periods 
without being exposed to ABES. This lack of practice changes the aeronautical phraseology used without the 
air traffic controller being aware of it. To address this lack of practice, it was decided to develop a new 
system based on automatic speech recognition technology that allows interaction between the student and the 
machine without the need for a second person to perform the pseudo-pilot task. The speech recognition 
function will serve as a basic tool for improving the quality of the evaluation. The user's willingness to move 
on to the next phase will be instantaneous. In addition, a time function has been integrated into the proposed 
system in order to determine the overall duration of the performance. The process chain can thus be described 
as illustrated Figure 3. 
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Figure 3. Flowchart describing 


The following rules will be used to assess performance. It should be noted that this is an aircraft 
engine failure event: 
- Each emergency activation is a consequence of detecting the term "Emergency". The "Emergency" 
expression is the event that triggers the emergency situation. 
- Once an emergency has been activated, a chronometer is set up to calculate the time spent in the whole 
emergency exercise. Time pressure is an essential element of emergency management. Wasting time 
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repeating messages and finding the correct phraseology to be understood will reduce the efficiency of the 
air traffic controller, cause misunderstandings and will not ensure pilot confidence in the service 
provided. 

- The input to the system is the student's speech and a list of predicted phraseology from the newly 
designed phraseology corpus. The corpus contains (so far) thirty-six transcripts from twelve students 
(seven female and five male) who were asked to participate in three different scenarios: i) loss of 
separation between two aircraft during initial climb due to engine failure; ii) failure of ground-to-air 
communication; and iii) overflying a restricted area due to bad weather conditions. The students’ 
communications were conducted in English, the official and most widely used language in aviation [11]. 

- The student's speech is compared to the word sequences in the corpus to detect errors. If the answer is not 
accepted, the student is invited to try again. 

- Each passage between the phases is the consequence of detecting the term "check". 

- The chronometer will always remain on while the practical exercise is in progress. The total time of the 
simulation will be a decisive factor in assessing the student's performance. 


2.2. Data 
2.2.1. Features extraction 

Poor quality or erroneous data can lead to difficulties in extracting information and making wrong 
predictions, so data must be properly prepared and collected. Features extraction is generally referred to as 
front-end signal processing [12]. Feature extraction techniques typically produce a multidimensional feature 
vector for each speech signal [13]. It is noted that speech features play an essential role in separating one 
speaker from another [14]. The extraction of features reduces the magnitude of the speech signal in a way 
that does not damage its power [15]. In our research, this is the first step that each of the recording files will 
go through. It consists of the transformation of one-dimensional audio data into three dimensional 
spectrograms after extraction of vectors characteristic of each vocal signal. 

There are a variety of different options for representing the speech signal for the process of 
recognition, Mel-frequency cepstral coefficient (MFCC) is the most popular [16]. The particularity of the 
transformation into MFCC is is that even more accuracy is obtained by increasing the size of the acoustic 
characteristic vectors, or by increasing their number. That is by increasing the number of MFCC coefficients. 


2.2.2. MFCC 

MFCC are cepstral coefficients calculated by a discrete cosine transformation applied to the signal's 
power spectrum. The frequency bands of this spectrum are logarithmically spaced along the Mel scale. The 
MFCC computation is the replication of the human auditory system that aims at an artificial implementation 
of the working principle of the ear, assuming that the human ear provides a reliable means of speaker 
recognition [17]. In our research, these coefficients are obtained by the following stages as illustrated in 
Figure 4 [15]: 
- Cut the signal into "frames". 
- Apply the Fourier transform to the acoustic signal corresponding to each frame to obtain the frequency 

spectrum of each signal. 

- Apply a logarithmic filter to the obtained spectrum as illustrated in (1) : 


Mel(f) = 2595 x Log(1 + £700) (1) 


- Reapply a Fourier transform to a cosine. 
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Figure 4. Mel-frequency cepstral coefficient [15] 
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The idea is to have a correspondence between the set of transformed speech and the set of words 
that we want our system to be able to identify, in particular: the expression "emergency" to trigger the 
exercise of the emergency situation and the expression "check" for the inter passage between two successive 
steps of the emergency management procedure (engine failure in our example). In addition, in order to avoid 
making the system reactive to noise, a third class has been added, grouping a number of noise patterns that 
may exist in the working environment of an ATC simulator. This set could only be completed by a set of 
"Labels" consisting of the images of the words corresponding to each speech via the tilde application ~. The 
established bijection, of the set of "Labels" of: 1) emergency, ii) control, and iii) noise, is represented in (2): 


1 0\ /0 
~! (°) ; (:) (0) — {emergency, check, noise} (2) 
0 O/ \1 


Our training dataset is now complete. A formalisation of the dataset is presented as shown in (3): 


dataset = {([[Mel(f,)]"""S]x'~(gx)) tel que k € [0; taille — 1]} (3) 
fi : Frequency of frame t ; 
nmel : Number of frames treated; 
Gk : Grammar corresponding to the k element. 


2.3. Creation of an intelligent speech recognition architecture 

In the last few years, the performance of deep learning algorithms has surpassed that of traditional 
machine learning algorithms. The most commonly used deep learning algorithms in the field of speech 
recognition are RNN and CNN [18]. CNNs have many applications in video and image recognition and 
recommendation systems [18], [19]. Mathematically, a convolution is the combination of two functions to 
obtain a third function. The inputs are reduced to a form without loss of features, thus reducing the 
computational complexity and increasing the success rate of the algorithm [20]. 

RNN are a family of neural networks specialised in processing sequential data. They can remember 
the input data received and predict precisely what will follow. Due to their nature, RNNs are successfully 
applied to sequential data such as time series, speech, video, and text [21]. Through the use of a long and 
short term memory architecture, the RNN is able to access long term memory. Long short term memory 
(LSTM) RNNs are a sort of gated RNNs which provide the most efficient models used in practical 
applications and solve the long-term dependency problem of RNNs [22]. 

During the project, we tested different architectures, which gave different results: the first 
architecture uses CNNs and gave satisfactory results: 96% accuracy and 3% error rate on the training dataset. 
The second is a recursive approach, using RNNs, which is notably good in terms of sequence prediction: 
99% accuracy and e~” % error rate on the training dataset. The architectures used for the two kinds of 
models described are respectively as listed in Figure 5. 


2.4. Training 

After creating the model's brain, the collected data will be used to find a value for the parameters of 
the model that will allow it to properly perform its recognition task. The training of the model is done in our 
case using a well-known technique in optimization: the gradient descent [23], [24]. A neural network is a set 
of calculating stages where formatted data will enter and be transformed in order to extract characteristics 
.This transformation will be done by means of weights and by the successive application of two mathematical 
operations: a linearity and a non-linearity [25]. The hyper-parameters of a model are first initialised with 
random values, then during training the output of the model will be calculated and compared to the expected 
value (from the dataset). The (4) represents the error terms obtained by deriving the error function with 
respect to each weight [26]. 


OE 


Vi, J eij = |Aw;;| = wis 


(4) 


E _ : Error function; 
wi : Weights of the neural network; 
i,j : Error indices. 
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Thus, the network adjusts its weights after each data sample until a value closer to the ideal is obtained. This 
learning process is in fact the gradient descent algorithm which works as [27]: 
For each batch of data do: 
V = prediction_model(batch) 
U = Label_correct(batch) 


Err = E(V,U) 
For i, j indices of the Err matrix do: 
fee OE | 
y y Owij 


E= lew], # weight correction matrix 


Correct weights () 


Next batch 

Layer (type) Output Shape Paran & 
onv2d4 {Conv2D) = (Wone, 124, 220, 64) 66d 
max pooling2d 1 (MaxPooling? (Hone, 62, 110, 64) 0 
loonv2d_2 (Conv2b) (Hone, 60, 108, 32) 18464 

1c r pooling2d 2 (QMaxPooling2 (Hone, 30, 54, 32) fi) 
dropout 1 (Dropout) (Hone, 30, 54, 32) rr} 
ponv2d 3 (Conv2D) (Hone, 28, 52, 16) 4624 

es  pooling2d 3 (MaxFooling2 (Hone, 14, 26, 16) 0 
dvopout_ 2 (Dropout) (Mone, 14, 26, 16) 0 
flatten _1 (Flatten) {Hone, 5824) 0 
dense 1 (Dense) (Hone, 10) $8250 
drepoat a (Drapaut) (Hone, 19) Q 
dense 2 (Dense) {Hone, 7) TV? 
dense 3 (Dense) (fone, 3) 24 


2 ee ee ee SS SS ES SES eS SS Se SS SSeS ee Se See See 
otal parans; 83,103 

frainable params; 63,103 

Mon trainshle parans: 0 


Layer (type) Output Shape Paran # 
joonvid 1 (Convib) {Hone, 128, 120) 143400 
convid 2 (Convib) (Hone, 128, 64) 24640 
convid 3 (Conv1D) (Hone, 128, 128) 41088 
canvid 4 (ConviD) (Hone, 128, 1278) 49280 
lconvild 5 (Conv1b) (Hone, 128, 128) 49280 
bidirectional 1 (Didirection (Hone, 120, 40) 23040 
flatten 1 (Flatten) (Hone, 5120) 0 
dense 1 (Dense) (fone, 7) 3hH4a7 
dense 2 (Dense) (Hone, 3) 24 


total params; 367,487 
trainshle params; 367,407 
Mon=trainable params: 0 


Figure 5. The respective architectures of the CNN and RNN 
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3. RESULTS AND DISCUSSIONS 

Our system can be considered as a real learning environment for the development of communication 
competence, for two main reasons: firstly, the analysis of phraseology is instantaneous thanks to speech 
recognition and secondly, the integration of the temporal constraint of performance to allow the simulation of 
the temporal pressure present in an abnormal and emergency situation. The existence of such a model 
provides a safe and non-threatening practice environment that tolerates trial and error. Most of the students 
did not perform well in communication in the first attempts: there were errors, disfluencies, hesitations in the 
messages conveyed and delays in communication. Some students reported feeling stressed at the beginning 
of the scenario because they had to use the correct phraseology and deal with an emergency situation at the 
same time. Figure 7 shows the average performance of a student for a single scenario repeated four times 
(engine failure in our example), in which they have to use correct aviation phraseology in a six-step engine 
failure management checklist as illustrated in Figure 6 [28]. 


Acknowledge 


emergency 


Figure 6. Engine failure checklist [28] 


The different pattern in Figure 7 indicate the student's performance in the six steps of the engine 
failure management exercise. The student's poor communication performance in the first two trials can be 
attributed to initial fear, use of incorrect aviation phraseology and unfamiliarity with the system. However, 
from the third repetition onwards, the student started to become familiar with the system and showed 
significantly better performance. 


50 


eee 
eee 


Steps of the 
checklist 
x1 
N 
is) 


Repetition 


Figure 7. Performance monitoring 


The students find the ICS is an effective for developing the ability to produce correct aeronautical 
phraseology, even under stress. However, the assessment of nonverbal communication such as speaker's 
postures and voice was not possible. Some students stressed the relevance of nonverbal communication in the 
communication process. 

As there are few qualitative studies on the use of speech recognition to train soft skills, especially 
communication skills in the context of air traffic control, this study could add value to future research. 
However, one limitation of this study is that the speech recognition model will need even more data to 
achieve a higher accuracy value. Thus, our first recommendation for the future will be to spend much more 
time on data collection, data augmentation and data processing, in order to obtain a rich and high quality 
database. 
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4. CONCLUSION 

In order to ensure proper management of emergency situations, air traffic controllers must be 
prepared to deal with multiple information simultaneously by listening, understanding and using correct and 
standard aeronautical phraseology. Our study consists of proposing a ICS based on automatic speech 
recognition, allowing the interaction between the student and the machine without the need for a second 
student to perform the task of the pseudo pilot. In addition, the function of calculating the time taken to 
transmit instructions and clearances issued by student air controllers during emergency management was 
incorporated into the system. Through instant practice and repetition, the students were able to develop 
effective and efficient communication that facilitates emergency management. However, a limitation of this 
study is that the speech recognition model will still need more data. As the research is still in the 
development phase, future work is to develop a scalable training system that allows the injection of new 
scenarios. 
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